Probability basics
Why Probability Distributions?
Inferential models depend on probability distributions:
estimation - is not deterministic, and so admits the unknown via disturbances \(\epsilon\). We characterize those unknowns as following some probability distribution.
inference - is essential because of the unknowns. Inference is possible because of the probability distribution characterizing the unknowns.
Probability Distributions
Probability distributions permit statements about relative scale, frequency, and uncertainty.
\(~\)
If the relation between \(x\) and \(y\) is measured by \(\widehat{\beta}\), we need to know whether or not to take \(\widehat{\beta}\) seriously - is it representative of the population relationship between \(x\) and \(y\)?
Things we want to know about \(x\)
- \(Pr(X = x)\) - the probability of observing any particular value of \(x\); these together comprise the density.
- \(Pr(X \leq x)\) - the probability of observing values up to and including \(x\) (or any range of \(x\)). (CDF)
- central tendency of \(x\) - mean, median, mode.
- dispersion - variance, or standard deviation (the average difference of any observation from the expected value).
Types of variables
Variables are either
discrete - observations match to integers; all possible values are clearly distinguishable; not divisible. E.g., number of protests in DC this year; an individual’s sex; Polity score.
continuous - observations can take on any real value between boundaries (sometimes \(-\infty, +\infty\)); infinitely divisible. E.g., household income, GDP per capita.
Discrete variables
May be of two types or levels of measurement:
nominal - categories are distinct, but lack order. E.g., religion = Hindu, Muslim, Catholic, Protestent, Jewish. Binary variables are nominal, e.g., Sex = male (0), female (1); do you have blue eyes? yes (0), no (1).
ordinal - take on countable values, increasing/decreasing in some dimension. E.g., Polity -10, -9, \(\ldots\) 0, 1, \(\ldots\) 9, 10 increasing in democracy; survey responses “Do you feel safe traveling abroad?” Not at all; sometimes; yes, completely.
Continuous variables
Can be of two types (levels of measurement):
interval - 1 unit increase has same meaning across the scale (i.e., the intervals are the same); e.g., degrees Celsius or Fahrenheit.
ratio - intervals but also has a meaningful absolute zero; e.g., weight in pounds; zero lbs indicates the absence of weight; Venmo balance = zero, means actually no money; degrees Kelvin. Duration of a war in days - zero days means there’s no war.
Levels of measurement
These four levels of measurement can be ordered by the amount of information a variable contains, least to most:
nominal
ordinal
interval
ratio
We can turn higher levels to lower levels, but not the opposite - doing so sacrifices information.
Levels of Measurement and Models
In general, the level of measurement of \(y\) (so the type and amount of information in a variable) shapes what type of model is appropriate.
discrete variables usually require statistics/models in the Binomial family (for our purposes, mostly MLE models like the Logit.)
continuous variables usually require statistics/models in the Normal/Gaussian family (for our purposes, mostly OLS models like the linear regression.)
Distributions
PDF and CDF
Probability distributions can be described by
Probability Density Function (PDF) which maps \(X\) onto the probability space describing the probabilities of every value of \(X\), \(x\).
Cumulative Distribution Function (CDF) which maps \(X\) onto the probability space describing the probability \(X\) is less than some value, \(x\).
PDF (Density)
PDF (Density)
PDF (Density)
While establishing the probability of a value of a discrete variable is possible, establishing the probability of a particular value of a continuous variable is not.
Instead, the continuous PDF describes the instantaneous rate of change in \(Pr(X=x)\) for every value of \(x\).
PDF plots
CDF
CDF plots
Notation
\(f(x)\) denotes the PDF of \(x\).
\(F(x)\) denotes the CDF of \(x\).
Substitute either the appropriate name, symbol, or function for \(F\) or \(f\) to indicate the distribution.
Lower case Greek letters denote PDFs: e.g. \(f(x)= \phi(x)\) denotes the normal PDF.
Upper case Greek letters denote CDFs: e.g. \(F(x)=\Phi(x)\) denotes the normal CDF.
Notation
Discrete Probability Distributions
Bernoulli
Suppose a binary variable \(x\) that takes on only the values of zero and one (\(x \in {0, 1}\)):
\[ Pr(X=1)=\pi\nonumber \\ Pr(x=0)= 1-Pr(x=1) \\ = 1-\pi \]
Bernoulli
The PDF is:
\[ f(x) = \\ \pi ~~~~~~~~ ~ ~ ~~~~ \text{if } x=1\\ 1-\pi ~ ~ ~ ~~~~\text{if } x=0 \]
or
\[f(x) = \pi^{x}(1-\pi)^{1-x} \]
Bernoulli
The CDF is:
\[F(x) =\sum_{x} f(x) \]
and the expected value is
\[ E(x) =\sum_{x}x f(x) \\ = (1)(\pi)+(0)(1-\pi) \\ = \pi \]
Binomial
Bernoulli is important because it is the foundation for a lot of other distributions, including the binomial distribution. The binomial describes the success probability function (where \(x=1\) is a “success”) from a set of \(n\) independent Bernoulli trials. So, the binomial is the probability of successes (“ones”) in \(n\) independent Bernoulli trials with identical probabilities, \(\pi\).
\(~\)
There are two essential parts to the binomial PDF - the probability of success, and the number of ways a success can occur.
Binomial
The probability of success is the Bernoulli probability:
\[ f(x) = \pi^{x}(1-\pi)^{1-x} \]
and the number of ways success can occur (called “n-tuples”) is
\[ \begin{pmatrix} n \\ x \end{pmatrix} = \frac{n!}{x!(n-x)!} \]
Notice the notation for the n-tuple.
Binomial
The PDF combines these:
\[ f(x)= \begin{pmatrix} n \\ x \end{pmatrix} \pi^{x}(1-\pi)^{n-x} \]
There are \(n\) trials, \(x\) is the number of successes, and \(n-x\) is the number of failures. Each event in the n-tuple arises with the Bernoulli probability.
Binomial
Binomial
Binomial family distributions
the geometric distribution describes repeated Bernoulli trials with probability of success \(\pi\), until the first success.
the negative binomial describes the number of Bernoulli failures prior to the first success - it can be thought of as a counting process up to the first success.
the poisson describes Bernoulli trials where \(\pi\) for any particular trial is very small.
Continuous Probability Distributions
The Normal Distribution
The Normal distribution is the most widely used distribution in the social sciences.
The normal seems to “fit” a lot of the variables social scientists measure and use in models.
Estimation techniques we often employ assume the disturbance term is normally distributed; one result is the coefficients are either normally distributed or t-distributed.
Normal PDF
The normal PDF: \[ Pr(Y=y_{i})=\frac{1}{\sqrt{2 \pi \sigma^{2}}} exp \left[\frac{-(y_{i}-\mu_{i})^{2}}{2\sigma^{2}}\right] \]
where two parameters, \(\mu\) and \(\sigma^{2}\) describe the location and shape of the distribution, the mean and variance respectively; we indicate a normally distributed variable and its parameters as
\[ Y \sim \text{Normal}(\mu,\sigma^{2}) \nonumber \]
code
gap <- read.csv("/Users/dave/documents/teaching/501/2024/slides/L1-data/data/gapminder.csv")
ggplot(gap, aes(x=alcohol_consumption_per_adult_15plus_litres)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") +
labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
ggtitle("Density - Alcohol Consumption per Adult (Liters)") code
gap$nalc <- dnorm(gap$alcohol_consumption_per_adult_15plus_litres, mean=mean(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE), sd=sd(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE))
ggplot() +
geom_histogram(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=..density..), colour="black", fill="white") +
geom_density(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres), alpha=.2, fill="#FF6666") +
geom_line(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=nalc), linetype="longdash", size=1)+
labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
ggtitle("Alcohol Consumption per Adult (Liters), Normal PDF")Standard Normal
A useful special case of the Normal is the standard normal, \(z \sim \text{Normal}(0,1)\). The PDF for the standard normal is given by
\[ \phi(z)=\frac{1}{\sqrt{2 \pi }} exp \left[\frac{-z^{2}}{2}\right] \nonumber \]
where the parameters themselves drop out since we’d be subtracting a mean of zero and dividing/multiplying by a variance of 1. The standard normal CDF is denoted \(\Phi(z)\); the standard normal PDF is denoted \(\phi(z)\).
Normal PDFs with different moments
The following plot shows three normal PDFs with different means and variances. The PDFs are centered at -1, 0, and 1, and have variances of .5, 1, and 1.5 respectively. Thinking in terms of uncertainty (foreshadowing a bit), imagine that larger variance indicates more uncertainty about the mean, and smaller variance indicates less uncertainty.
Why Distributions Matter to models
Models
Probability models include unobserved disturbances; we make assumptions about the distributions of those errors, \(\epsilon\).
What we assume about the unobservables is always informed by what we know about the observables, mainly \(y\).
A useful way to describe or summarize \(y\) is to characterize its observed distribution.
The distribution of \(y\) informs our assumption about the distribution of \(\epsilon\).
Why this matters to OLS
Linear regression is such an inferential model; we have a number of sources of uncertainty. We represent that uncertainty in the model via the disturbance term, \(\epsilon\).
\(~\)
In order to know things about \(\epsilon\) and other parts of the model, we assume that the disturbances are normally distributed; \(y\) should be normal too.
Normality and Centrality
We rely on stuff being normally distributed, and central tendency being meaningful. The Central Limit Theorem facilitates this.
Central Limit Theorem
Suppose \(n\) random variables - call them \(X_1, X_2 \ldots X_n\). These variables are not identically distributed; some are discrete, some continuous. Each variable, \(X_i\) has mean \(\bar{X_k}\).
\[ {\sum\limits_{n \rightarrow \infty} (\bar{X_i})} / {n} \sim N (\mu, \sigma^2) \]
As \(n\) approaches infinity, the distribution of means, \(\bar{X_i}\) is distributed Normal, with mean \(\widetilde{X_i}\). We could accomplish the same thing by repeated sampling of a single variable.
Simulating the Central Limit Theorem
The following app simulates the Central Limit Theorem. You can select a distribution, sample size, and number of simulations. The app will plot the distribution of the means of the samples. You’ll see that, regardless of the distribution the means are drawn from, the distribution of the means is normal as the sample size increases.
CLT - Why is this valuable?
If we assume repeated sampling and/or infinitely large samples, we can assume normality.
We know the properties of the Normal intimately well. We can use this knowledge to evaluate where some value of \(x\) lies on the Normal CDF, what the probability less than that value is - we can figure out what the probability of observing that value is (on the PDF).
As we’ll see, \(\widehat{\beta}\) is normally distributed in the OLS model (it is in MLE as well). The CLT and what we know about normality facilitate inference.
Inference
Inference is our effort to measure and characterize our uncertainty about the model and its parameters.
uncertainty is the most important thing we estimate in inferential models.
characterizing uncertainty relies on probability theory.